Skip to content

Conversation

Tim-Brooks
Copy link
Contributor

Currently Elasticsearch is using StandardCharsets#decode and encode
methods when working with optimized text. These variants are not as
performant as the direct implementations in String when working with
byte[]. If we are going to one-shot convert without validation then the
String variants should be preferred.

Currently Elasticsearch is using StandardCharsets#decode and encode
methods when working with optimized text. These variants are not as
performant as the direct implementations in String when working with
byte[]. If we are going to one-shot convert without validation then the
String variants should be preferred.
@Tim-Brooks Tim-Brooks requested a review from a team as a code owner October 6, 2025 19:09
@Tim-Brooks Tim-Brooks added >non-issue :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. v9.3.0 labels Oct 6, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Indexing Meta label for Distributed Indexing team label Oct 6, 2025
@Tim-Brooks
Copy link
Contributor Author

Tim-Brooks commented Oct 6, 2025

I noticed this a few months ago when I was benchmarking something and it seemed to be we should switch variants. I can't think of a scenario where we would prefer the byte buffer encoder variant for String -> bytes. I could see the scenario where we would prefer the byte buffer variant for byte -> String if we wanted to catch UTF-8 errors. However, the default malformed input variant uses replace which is the same as the String ctor variant.

    public final CharBuffer decode(ByteBuffer bb) {
        try {
            return ThreadLocalCoders.decoderFor(this)
                .onMalformedInput(CodingErrorAction.REPLACE)
                .onUnmappableCharacter(CodingErrorAction.REPLACE)
                .decode(bb);
        } catch (CharacterCodingException x) {
            throw new Error(x);         // Can't happen
        }
    }

I would note that this is probably different behavior than Jackson would give prior to this change. Are we okay with silently replacing invalid UTF-8 input?

@jordan-powers - added the code
@felixbarny - recently worked with the code
@martijnvg - reviewed a change a did a while back related to optimized input

@Tim-Brooks
Copy link
Contributor Author

UTF8StringBytesBenchmark.getBytesByteBufferEncoder           uuid  avgt    3    36.610 ±   2.235  ns/op
UTF8StringBytesBenchmark.getBytesByteBufferEncoder          short  avgt    3    36.502 ±   5.117  ns/op
UTF8StringBytesBenchmark.getBytesByteBufferEncoder           long  avgt    3   122.581 ±  10.805  ns/op
UTF8StringBytesBenchmark.getBytesByteBufferEncoder       nonAscii  avgt    3   244.617 ± 101.226  ns/op
UTF8StringBytesBenchmark.getBytesByteBufferEncoder       veryLong  avgt    3  1149.046 ±  82.400  ns/op

UTF8StringBytesBenchmark.getBytesJDK                         uuid  avgt    3     3.772 ±   1.984  ns/op
UTF8StringBytesBenchmark.getBytesJDK                        short  avgt    3     3.723 ±   1.799  ns/op
UTF8StringBytesBenchmark.getBytesJDK                         long  avgt    3     6.737 ±   3.997  ns/op
UTF8StringBytesBenchmark.getBytesJDK                     nonAscii  avgt    3   134.981 ±  22.179  ns/op
UTF8StringBytesBenchmark.getBytesJDK                     veryLong  avgt    3    31.704 ±   0.360  ns/op

UTF8StringBytesBenchmark.getBytesUnicodeUtils                uuid  avgt    3    29.224 ±  23.463  ns/op
UTF8StringBytesBenchmark.getBytesUnicodeUtils               short  avgt    3    29.634 ±  23.561  ns/op
UTF8StringBytesBenchmark.getBytesUnicodeUtils                long  avgt    3    44.193 ±   8.370  ns/op
UTF8StringBytesBenchmark.getBytesUnicodeUtils            nonAscii  avgt    3   196.680 ±   9.583  ns/op
UTF8StringBytesBenchmark.getBytesUnicodeUtils            veryLong  avgt    3   417.743 ±  14.938  ns/op

UTF8StringBytesBenchmark.getStringByteBufferDecoder          uuid  avgt    3    20.832 ±   0.228  ns/op
UTF8StringBytesBenchmark.getStringByteBufferDecoder         short  avgt    3    20.872 ±   0.444  ns/op
UTF8StringBytesBenchmark.getStringByteBufferDecoder          long  avgt    3    27.744 ±   0.130  ns/op
UTF8StringBytesBenchmark.getStringByteBufferDecoder      nonAscii  avgt    3   174.942 ±  36.993  ns/op
UTF8StringBytesBenchmark.getStringByteBufferDecoder      veryLong  avgt    3   126.116 ±  13.467  ns/op

UTF8StringBytesBenchmark.getStringJDK                        uuid  avgt    3     6.353 ±   0.564  ns/op
UTF8StringBytesBenchmark.getStringJDK                       short  avgt    3     6.497 ±   2.621  ns/op
UTF8StringBytesBenchmark.getStringJDK                        long  avgt    3    10.237 ±   0.547  ns/op
UTF8StringBytesBenchmark.getStringJDK                    nonAscii  avgt    3   161.460 ±  13.172  ns/op
UTF8StringBytesBenchmark.getStringJDK                    veryLong  avgt    3    34.948 ±   2.518  ns/op

Copy link
Contributor

@jordan-powers jordan-powers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for looking into this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >non-issue Team:Distributed Indexing Meta label for Distributed Indexing team v9.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants